Project: Investigate a Dataset - TMDB 5000 Movie Dataset¶

Selman Karaosmanoglu¶

Table of Contents¶

  • Introduction
  • Data Wrangling
  • Exploratory Data Analysis
  • Conclusions

Introduction¶

Dataset Description¶

This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue. Each entry contains the following information about a film:

  • imdb_id: The unique identifier of the movie on the Internet Movie Database (IMDb).
  • popularity: A measure of how popular the movie is.
  • budget: The amount of money spent to produce the movie.
  • revenue: The amount of money earned by the movie.
  • original_title: The title of the movie as it was originally released, in its original language.
  • cast: A list of the main actors and actresses who starred in the movie, separated by pipe characters(|).
  • homepage: The URL of the official website of the movie.
  • director: The name of the person who directed the movie.
  • tagline: A short and catchy phrase that summarizes the main theme
  • keywords: A list of words or phrases related to the movie separated by pipe characters(|).
  • overview: A brief summary of the movie’s story.
  • runtime: The duration of the movie, in minutes.
  • genres: A list of the categories that the movie belongs to, such as comedy, drama, horror, etc., separated by pipe characters(|).
  • production_companies: A list of the companies that were involved in the production of the movie, separated by pipe characters(|).
  • release_date: The date when the movie was first released to the public, in the format MM-DD-YY.
  • vote_count: The number of votes that the movie has received from users.
  • vote_average: The average rating that the movie has received from users on a scale from 0 to 10.
  • release_year: The year when the movie was first released in the format YYYY.
  • budget_adj: The amount of money spent to produce the movie, adjusted for inflation, in 2010.
  • revenue_adj: The amount of money earned by the movie from ticket sales, adjusted for inflation, in 2010.

Question(s) for Analysis¶

  • Research Question 1 - What are the top 5 cast members by frequency in the dataset?
  • Research Question 2 - What are the most and least frequent genres of movies in the dataset?
  • Research Question 3 - Which genres are most popular from year to year?
  • Research Question 4 - How many movies were released in each year in the dataset?
  • Research Question 5 - What is the median popularity of movies by genres?
  • Research Question 6 - Is there a statistically significant difference between median of the movie's popularity across genres?
  • Research Question 7 - Does the runtime of a movie have a different impact on its popularity depending on its genre?
In [1]:
!python --version
Python 3.11.7
In [2]:
# Install libraries
# !pip install scikit-posthocs
In [3]:
# Import libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.stats as stats
import plotly.express as px
import ipywidgets as widgets
from IPython.display import display
import scikit_posthocs as sp
import plotly.io as pio
pio.renderers.default='notebook'

%matplotlib inline
In [4]:
# Display all columns in the dataframe
pd.set_option('display.max_columns', None)

# Set the display format for floats
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Set the figure size
plt.rcParams["figure.figsize"] = (10, 6)

# Set style for plots
sns.set_style("whitegrid")

# Ignore all warnings
import warnings
warnings.filterwarnings("ignore")
warnings.warn("this will not show")
In [5]:
pd.__version__
Out[5]:
'2.1.4'

Data Wrangling¶

In [6]:
# Load csv
df = pd.read_csv('tmdb-movies.csv')
In [7]:
df.head()
Out[7]:
id imdb_id popularity budget revenue original_title cast homepage director tagline keywords overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.986 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. monster|dna|tyrannosaurus rex|velociraptor|island Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.500 2015 137999939.280 1392445892.524
1 76341 tt1392190 28.420 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. future|chase|post-apocalyptic|dystopia|australia An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.100 2015 137999939.280 348161292.489
2 262500 tt2908446 13.113 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You based on novel|revolution|dystopia|sequel|dyst... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.300 2015 101199955.472 271619025.408
3 140607 tt2488496 11.173 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. android|spaceship|jedi|space opera|3d Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.500 2015 183999919.040 1902723129.802
4 168259 tt2820852 9.335 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home car race|speed|revenge|suspense|car Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/15 2947 7.300 2015 174799923.088 1385748801.471
In [8]:
df.shape
Out[8]:
(10866, 21)
In [9]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date          10866 non-null  object 
 16  vote_count            10866 non-null  int64  
 17  vote_average          10866 non-null  float64
 18  release_year          10866 non-null  int64  
 19  budget_adj            10866 non-null  float64
 20  revenue_adj           10866 non-null  float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB
In [10]:
# First look - Descriptive statistics
df.describe().T
Out[10]:
count mean std min 25% 50% 75% max
id 10866.000 66064.177 92130.137 5.000 10596.250 20669.000 75610.000 417859.000
popularity 10866.000 0.646 1.000 0.000 0.208 0.384 0.714 32.986
budget 10866.000 14625701.094 30913213.831 0.000 0.000 0.000 15000000.000 425000000.000
revenue 10866.000 39823319.793 117003486.582 0.000 0.000 0.000 24000000.000 2781505847.000
runtime 10866.000 102.071 31.381 0.000 90.000 99.000 111.000 900.000
vote_count 10866.000 217.390 575.619 10.000 17.000 38.000 145.750 9767.000
vote_average 10866.000 5.975 0.935 1.500 5.400 6.000 6.600 9.200
release_year 10866.000 2001.323 12.813 1960.000 1995.000 2006.000 2011.000 2015.000
budget_adj 10866.000 17551039.823 34306155.723 0.000 0.000 0.000 20853251.084 425000000.000
revenue_adj 10866.000 51364363.253 144632485.040 0.000 0.000 0.000 33697095.717 2827123750.412

Data Cleaning¶

User-defined functions¶

In [11]:
def replace0(df:pd.DataFrame,feature:str):
    """Replace 0s with NaNs in a dataframe column and returns the zero counts."""
    before = df.query(f'{feature} == 0').shape[0]
    df[feature].replace({0: np.nan}, inplace=True)
    after = df.query(f'{feature} == 0').shape[0]
    return f"the number of zeros in {feature} column : {before} -> {after}"

1. Drop unused columns¶

Drop unused columns that is not related to research questions.

  • 'budget' and 'revenue' features exist in adjusted form in the dataset.
  • 'imdb_id', 'homepage', 'tagline', 'overview', 'keywords' features are not related to research questions.
In [12]:
# Display a sample
df.sample(1)
Out[12]:
id imdb_id popularity budget revenue original_title cast homepage director tagline keywords overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
938 254024 tt2967008 0.409 0 0 The Face of an Angel Kate Beckinsale|Cara Delevingne|Daniel Brühl|... NaN Michael Winterbottom Forget the truth, find the story. murder investigation Both a journalist and a documentary filmmaker ... 100 Thriller|Drama Revolution Films|BBC Films|Cattleya|Vedette Fi... 9/6/14 28 4.200 2014 0.000 0.000
In [13]:
# Drop columns
drop_column_list = ['imdb_id','budget','revenue','homepage','tagline','overview','keywords']
display(df.shape)
df.drop(columns=drop_column_list,inplace=True)
display(df.shape)
(10866, 21)
(10866, 14)
In [14]:
# Display a sample
df.sample(1)
Out[14]:
id popularity original_title cast director runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
8048 12633 0.558 The Star Chamber Michael Douglas|Hal Holbrook|Yaphet Kotto|Shar... Peter Hyams 109 Action|Crime|Drama|Thriller Twentieth Century Fox Film Corporation|Frank Y... 8/5/83 15 5.100 1983 0.000 0.000

2. Remove duplicated columns¶

In [15]:
# Check if the Dataset have any Duplicate

df.duplicated().value_counts()
Out[15]:
False    10865
True         1
Name: count, dtype: int64
In [16]:
# Drop Duplicates

df.drop_duplicates(inplace=True)
In [17]:
# Display shape

df.shape
Out[17]:
(10865, 14)

3. Check missing values¶

In [18]:
# Display descriptive statistics
df.describe().T
Out[18]:
count mean std min 25% 50% 75% max
id 10865.000 66066.374 92134.092 5.000 10596.000 20662.000 75612.000 417859.000
popularity 10865.000 0.646 1.000 0.000 0.208 0.384 0.714 32.986
runtime 10865.000 102.072 31.383 0.000 90.000 99.000 111.000 900.000
vote_count 10865.000 217.400 575.645 10.000 17.000 38.000 146.000 9767.000
vote_average 10865.000 5.975 0.935 1.500 5.400 6.000 6.600 9.200
release_year 10865.000 2001.322 12.813 1960.000 1995.000 2006.000 2011.000 2015.000
budget_adj 10865.000 17549894.037 34307526.658 0.000 0.000 0.000 20853251.084 425000000.000
revenue_adj 10865.000 51369001.759 144638333.129 0.000 0.000 0.000 33701729.009 2827123750.412

In descriptive statistics of the dataset, it is seen that there are 0 values in runtime, budget_adj and revenue_adj. 0 values in these features will be considered as null values.

Popularity feature may have been calculated as 0.

In [19]:
# replace 0 with np.nan in the runtime column
replace0(df,'runtime')
Out[19]:
'the number of zeros in runtime column : 31 -> 0'
In [20]:
# replace 0 with np.nan in the budget_adj
replace0(df,'budget_adj')
Out[20]:
'the number of zeros in budget_adj column : 5696 -> 0'
In [21]:
# replace 0 with np.nan in the revenue_adj
replace0(df,'revenue_adj')
Out[21]:
'the number of zeros in revenue_adj column : 6016 -> 0'
In [22]:
# Check the sum of Missing Values per column

df.isnull().sum()
Out[22]:
id                         0
popularity                 0
original_title             0
cast                      76
director                  44
runtime                   31
genres                    23
production_companies    1030
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj              5696
revenue_adj             6016
dtype: int64
In [23]:
# Check the Percentage of Missing Values

df.isnull().sum() / df.shape[0] * 100
Out[23]:
id                      0.000
popularity              0.000
original_title          0.000
cast                    0.699
director                0.405
runtime                 0.285
genres                  0.212
production_companies    9.480
release_date            0.000
vote_count              0.000
vote_average            0.000
release_year            0.000
budget_adj             52.425
revenue_adj            55.370
dtype: float64
In [24]:
# Plot the Percentage of Missing Values

ax = (df.isnull().sum() / df.shape[0] * 100).plot(kind='barh')
plt.bar_label(ax.containers[0], fmt='%.2f%%');
No description has been provided for this image

4. Convert to datetime format¶

We need to convert release_date format to pd dateformat in order to make analysis using pandas.

In [25]:
# Display a sample
df.release_date.sample(1)
Out[25]:
31    3/4/15
Name: release_date, dtype: object
In [26]:
# Convert str to date format
df['release_date'] = pd.to_datetime(df['release_date'])
In [27]:
# Control a sample
df.release_date.loc[10561]
Out[27]:
Timestamp('1986-09-23 00:00:00')
In [28]:
# List of columns which data types are object
df.select_dtypes(include=["object"]).columns
Out[28]:
Index(['original_title', 'cast', 'director', 'genres', 'production_companies'], dtype='object')
In [29]:
# Descriptive statistics of object columns
df.select_dtypes(include=["object"]).describe().T
Out[29]:
count unique top freq
original_title 10865 10571 Hamlet 4
cast 10789 10719 Louis C.K. 6
director 10821 5067 Woody Allen 45
genres 10842 2039 Comedy 712
production_companies 9835 7445 Paramount Pictures 156
In [30]:
# make a copy
df0 = df.copy()

Exploratory Data Analysis¶

User-defined functions¶

In [31]:
def explode(df_:pd.DataFrame,feature:str):
    """Split and expand a column of delimited values into multiple rows."""
    df_temp = df_.copy()
    df_temp[feature] = df_temp[feature].str.split('|')
    return df_temp.explode(feature)

Research Question 1 - What are the top 5 cast members by frequency in the dataset?¶

In [32]:
# Get one sample
df.sample(1)
Out[32]:
id popularity original_title cast director runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
6500 292 0.151 Dave Chappelle's Block Party Dave Chappelle|Erykah Badu|Common|Mos Def|Bilal Michel Gondry 100.000 Comedy|Documentary|Music Yari Film Group 2005-09-12 15 7.000 2005 3349689.869 13084552.983
In [33]:
# Explode dataframe to have each genre as a seperate row

df_cast_explode = explode(df,'cast')
df_cast_explode.head(3)
Out[33]:
id popularity original_title cast director runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 32.986 Jurassic World Chris Pratt Colin Trevorrow 124.000 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 2015-06-09 5562 6.500 2015 137999939.280 1392445892.524
0 135397 32.986 Jurassic World Bryce Dallas Howard Colin Trevorrow 124.000 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 2015-06-09 5562 6.500 2015 137999939.280 1392445892.524
0 135397 32.986 Jurassic World Irrfan Khan Colin Trevorrow 124.000 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 2015-06-09 5562 6.500 2015 137999939.280 1392445892.524
In [34]:
# the number of null values

df_cast_explode.cast.isna().sum()
Out[34]:
76
In [35]:
# Drop null values

display(df_cast_explode.shape)
df_cast_explode.dropna(subset=['cast'], inplace = True)
display(df_cast_explode.shape)
(52644, 14)
(52568, 14)
In [36]:
# the number of duplicated records

df_cast_explode.duplicated().sum()
Out[36]:
19
In [37]:
# Drop duplicates

df_cast_explode.drop_duplicates(inplace=True)
display(df_cast_explode.shape)
(52549, 14)
In [38]:
# the unique number of cast members in the dataset

df_cast_explode.cast.nunique()
Out[38]:
19026

There are 19026 different cast in the dataset

In [39]:
# Top 5 cast members

cast_counts = df_cast_explode['cast'].value_counts()[:5]
cast_counts
Out[39]:
cast
Robert De Niro       72
Samuel L. Jackson    71
Bruce Willis         62
Nicolas Cage         61
Michael Caine        53
Name: count, dtype: int64
In [40]:
# Visualization of top 5 members

ax = sns.barplot(x=cast_counts.index, y=cast_counts.values, palette = 'Blues_r')
plt.xlabel('Cast')
plt.ylabel('Count')
plt.title('Top 5 cast members by frequency')
plt.xticks(rotation=45)
for container in ax.containers:
    plt.bar_label(container);
No description has been provided for this image

Conclusion: The top 5 cast members in the dataset by frequency are Robert De Niro(72), Samuel L. Jackson(71), Bruce Willis(62), Nicolas Cage(61) and Michael Caine(53)

Research Question 2 - What are the most and least frequent genres of movies in the dataset?¶

In [41]:
# Explode dataframe to have each genre as a seperate row
df_genres_explode = explode(df,'genres')
df_genres_explode.head(3)
Out[41]:
id popularity original_title cast director runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 32.986 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124.000 Action Universal Studios|Amblin Entertainment|Legenda... 2015-06-09 5562 6.500 2015 137999939.280 1392445892.524
0 135397 32.986 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124.000 Adventure Universal Studios|Amblin Entertainment|Legenda... 2015-06-09 5562 6.500 2015 137999939.280 1392445892.524
0 135397 32.986 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124.000 Science Fiction Universal Studios|Amblin Entertainment|Legenda... 2015-06-09 5562 6.500 2015 137999939.280 1392445892.524
In [42]:
# Look at its shape
df_genres_explode.shape
Out[42]:
(26978, 14)
In [43]:
# Genres in the dataset
df_genres_explode.genres.unique()
Out[43]:
array(['Action', 'Adventure', 'Science Fiction', 'Thriller', 'Fantasy',
       'Crime', 'Western', 'Drama', 'Family', 'Animation', 'Comedy',
       'Mystery', 'Romance', 'War', 'History', 'Music', 'Horror',
       'Documentary', 'TV Movie', nan, 'Foreign'], dtype=object)
In [44]:
# the number of null values in genres
df_genres_explode.genres.isna().sum()
Out[44]:
23
In [45]:
# Drop nulls if exist in genres column
display(df_genres_explode.shape)
df_genres_explode.dropna(subset=['genres'], inplace = True)
display(df_genres_explode.shape)
(26978, 14)
(26955, 14)
In [46]:
# Genres count in the dataset
df_genres_explode.groupby('genres').count()[['id']]
Out[46]:
id
genres
Action 2384
Adventure 1471
Animation 699
Comedy 3793
Crime 1354
Documentary 520
Drama 4760
Family 1231
Fantasy 916
Foreign 188
History 334
Horror 1637
Music 408
Mystery 810
Romance 1712
Science Fiction 1229
TV Movie 167
Thriller 2907
War 270
Western 165
In [47]:
# Genres count in the dataset sorted descending
df_genres_explode['genres'].value_counts()
Out[47]:
genres
Drama              4760
Comedy             3793
Thriller           2907
Action             2384
Romance            1712
Horror             1637
Adventure          1471
Crime              1354
Family             1231
Science Fiction    1229
Fantasy             916
Mystery             810
Animation           699
Documentary         520
Music               408
History             334
War                 270
Foreign             188
TV Movie            167
Western             165
Name: count, dtype: int64
In [48]:
# the visualization of genres count in the dataset sorted descending
ax = sns.countplot(x='genres',data=df_genres_explode, order=df_genres_explode['genres'].value_counts().index, palette = 'Blues_r')
plt.xticks(rotation=45)
for container in ax.containers:
    plt.bar_label(container);
No description has been provided for this image

Conclusion: the most frequent genre is Drama(4760), the lease frequent genre is Western(165).

Research Question 3 - Which genres are most popular from year to year?¶

In [49]:
# Genres popularity mean by year
df_genres_explode.groupby(['release_year','genres'])[['popularity']].mean().reset_index().head(20)
Out[49]:
release_year genres popularity
0 1960 Action 0.591
1 1960 Adventure 0.701
2 1960 Comedy 0.396
3 1960 Crime 0.346
4 1960 Drama 0.566
5 1960 Family 0.278
6 1960 Fantasy 0.428
7 1960 Foreign 0.195
8 1960 History 0.412
9 1960 Horror 0.591
10 1960 Music 0.424
11 1960 Romance 0.541
12 1960 Science Fiction 0.328
13 1960 Thriller 0.812
14 1960 War 0.226
15 1960 Western 0.568
16 1961 Action 0.366
17 1961 Adventure 0.767
18 1961 Animation 2.632
19 1961 Comedy 0.680
In [50]:
# Line plot of genres popularity mean by year 
df2 = df_genres_explode.groupby(['release_year','genres'])[['popularity']].mean().reset_index()
ax = sns.lineplot(x='release_year',y='popularity',hue='genres',data=df2)
sns.move_legend(ax,'upper left', bbox_to_anchor=(1,1))
No description has been provided for this image

Due to the high number of unique genres, a separate graph will be drawn for each genre, after which an interactive version will be built to better understand the data.

In [51]:
g = sns.FacetGrid(df_genres_explode, col="genres", col_wrap=5, height=3, aspect=1.3)
g.map(sns.lineplot, "release_year", "popularity")
g.set_titles("{col_name}")
g.tight_layout()
Out[51]:
<seaborn.axisgrid.FacetGrid at 0x23f2ef8e7d0>
No description has been provided for this image

Genres popularity from year to year.

In [52]:
# Create an interactive line plot with plotly express
fig = px.line(df2, x='release_year', y='popularity', color='genres', width= 1000, height = 600)

# Show the interactive plot
fig.show()

Interactive plot of genres popularity from year to year.

In [53]:
# Select the most popular genre of the years
idx = df2.groupby('release_year')['popularity'].idxmax()
In [54]:
# Select those rows from the original dataframe
most_popular = df2.loc[idx]

# What are the most popular genres each year?
most_popular
Out[54]:
release_year genres popularity
13 1960 Thriller 0.812
18 1961 Animation 2.632
34 1962 Adventure 0.943
52 1963 Animation 2.180
83 1964 War 0.931
95 1965 Music 0.969
105 1966 Animation 0.586
123 1967 Animation 1.349
149 1968 Mystery 1.519
159 1969 Crime 0.948
175 1970 Animation 1.128
199 1971 Family 1.531
215 1972 Crime 1.073
229 1973 Animation 0.957
257 1974 Mystery 0.702
265 1975 Adventure 0.880
285 1976 Crime 0.707
301 1977 Action 1.419
331 1978 Music 0.680
339 1979 Action 1.410
368 1980 Science Fiction 0.897
374 1981 Adventure 0.876
409 1982 War 1.143
412 1983 Adventure 0.901
437 1984 Family 0.824
455 1985 Family 0.924
467 1986 Adventure 0.799
491 1987 History 0.816
501 1988 Action 0.599
522 1989 Animation 1.178
540 1990 Adventure 0.802
561 1991 Animation 1.665
580 1992 Animation 1.287
605 1993 Fantasy 0.919
620 1994 Crime 1.298
638 1995 Animation 1.468
659 1996 Crime 0.977
689 1997 Science Fiction 1.140
711 1998 War 1.247
714 1999 Adventure 1.012
734 2000 Adventure 0.855
761 2001 Fantasy 1.565
781 2002 Fantasy 1.430
801 2003 Fantasy 1.748
821 2004 Fantasy 1.321
841 2005 Fantasy 1.118
861 2006 Fantasy 1.023
881 2007 Fantasy 0.957
894 2008 Adventure 1.008
914 2009 Adventure 1.138
933 2010 Adventure 1.360
971 2011 Western 1.176
991 2012 Western 1.733
993 2013 Adventure 1.261
1012 2014 Adventure 2.431
1031 2015 Adventure 3.284

Conclusion: The most popular genre in the last three years is adventure. the most popular genre was fantasy from 2001 to 2007.

Research Question 4 - How many movies were released in each year in the dataset?¶

In [55]:
# Get the number of movies by grouping according to release year.
df_year = df.groupby('release_year').size()
df_year.head(3)
Out[55]:
release_year
1960    32
1961    31
1962    32
dtype: int64
In [56]:
# Line plot of number of movies across years.
sns.lineplot(x=df_year.index,y=df_year.values, label= 'Number of movies', color='green')
plt.xlabel("Release year")
plt.ylabel("Number of movies")
plt.title("Number of movies by release year")
plt.show()
No description has been provided for this image
In [57]:
# https://plotly.com/python/line-charts/

fig = px.line(df_year, x=df_year.index, y=df_year.values, labels={'y':'Number of movies'}, title='Number of movies by release year', width= 1000, height=600)
fig.update_layout(hovermode="x")
fig.show()

Conclusion: The number of movies increased gradually over the years.

Research Question 5 - What is the median popularity of movies by genres?¶

In [58]:
# The histogram of popularity
sns.histplot(x='popularity',kde=True, data=df, bins=20);
No description has been provided for this image
In [59]:
# Descriptive statistics of popularity
df.describe()[['popularity']].T
Out[59]:
count mean min 25% 50% 75% max std
popularity 10865.000 0.646 0.000 0.208 0.384 0.714 32.986 1.000
In [60]:
# The boxplot of popularity
sns.boxplot(x='popularity', data=df)
Out[60]:
<Axes: xlabel='popularity'>
No description has been provided for this image
In [61]:
# Histograms of popularity by genre
plt.figure(figsize=(20,15))
for count,genre in enumerate(df_genres_explode.genres.unique()):
    plt.subplot(5,5,count + 1)
    sns.histplot(x='popularity',kde=True, data=df_genres_explode.query('genres == @genre'), bins=20)
    plt.title(genre)
plt.tight_layout()
No description has been provided for this image
In [62]:
# Genre popularity skewness
for count,genre in enumerate(df_genres_explode.genres.unique()):
    print(f"{genre}: {df_genres_explode.query('genres == @genre').popularity.skew()}")
Action: 8.519559576213021
Adventure: 7.544246925782778
Science Fiction: 8.346574445070894
Thriller: 12.31808653463584
Fantasy: 3.3442974777303105
Crime: 4.141872354110435
Western: 6.173147812882716
Drama: 9.26647507842431
Family: 3.399000948901938
Animation: 3.2776512678707967
Comedy: 3.85839662551183
Mystery: 4.033738855312595
Romance: 3.0004699872067215
War: 4.07009677907743
History: 5.559047399136068
Music: 3.404349210716752
Horror: 5.209936328110234
Documentary: 1.7119359791142117
TV Movie: 2.8382606971691686
Foreign: 1.1906528356317898

All genre popularity skewness are right-skewed in the dataset. In this case looking at the medians of the popularity is more meaningful than looking at the means, as the medians are less affected by outliers.

In [63]:
# Looking at the mean and the median side by side
df_genres_explode.groupby('genres')[['popularity']].agg(['mean','median'])
Out[63]:
popularity
mean median
genres
Action 0.926 0.470
Adventure 1.154 0.561
Animation 0.852 0.501
Comedy 0.593 0.389
Crime 0.745 0.476
Documentary 0.181 0.142
Drama 0.591 0.389
Family 0.787 0.470
Fantasy 0.993 0.530
Foreign 0.191 0.165
History 0.576 0.413
Horror 0.465 0.322
Music 0.487 0.322
Mystery 0.690 0.428
Romance 0.592 0.404
Science Fiction 1.002 0.440
TV Movie 0.271 0.231
Thriller 0.742 0.444
War 0.728 0.432
Western 0.591 0.328
In [64]:
# Median popularity of genres
df_median_popularity = df_genres_explode.groupby('genres')['popularity'].agg(['median'])
df_median_popularity
Out[64]:
median
genres
Action 0.470
Adventure 0.561
Animation 0.501
Comedy 0.389
Crime 0.476
Documentary 0.142
Drama 0.389
Family 0.470
Fantasy 0.530
Foreign 0.165
History 0.413
Horror 0.322
Music 0.322
Mystery 0.428
Romance 0.404
Science Fiction 0.440
TV Movie 0.231
Thriller 0.444
War 0.432
Western 0.328

Conclusion: Popularity skewness are right skewed across genres.

Research Question 6 - Is there a statistically significant difference between median of the movie's popularity across genres?¶

https://www.statology.org/kruskal-wallis-test-python/

In [65]:
# Make genre groups to be calculated in the statistical test
genre_groups = [group['popularity'].values for name,group in df_genres_explode.groupby('genres')]
In [66]:
# Apply Kruskal Wallis test to genre groups
statistic, p_value = stats.kruskal(*genre_groups)
In [67]:
# display Kruskal Wallis test results
print(f'Kruskal-Wallis Statistic: {statistic}')
print(f'p-value: {p_value}')
Kruskal-Wallis Statistic: 1586.9103452441982
p-value: 0.0
In [68]:
# Perform hyptoheses test
if p_value < 0.05:
    print('There is a statistically significant difference in median popularity across genres')
else:
    print('There is no statistically significant difference in median popularity across genres')
There is a statistically significant difference in median popularity across genres
In [69]:
# https://www.statology.org/dunns-test-python/
# To find out which groups have different medians after a Kruskal-Wallis test, we can use Dunn’s Test for pairwise comparisons.
posthoc_result = sp.posthoc_dunn(df_genres_explode,val_col='popularity',group_col='genres',p_adjust='bonferroni')
In [70]:
# dataframe that shows True if the difference is statistically significant, shows False if there is not statistically significant difference between medians of the pairs.
posthoc_result_p = posthoc_result < 0.05
In [71]:
# Heatmap of genre pairs showing whether there is a statistically significant difference between pair medians.

plt.figure(figsize=(20, 20)) 
sns.heatmap(posthoc_result_p, annot=True, cmap='Blues', fmt="", cbar=False) 
plt.xlabel("Genres")
plt.ylabel("Genres")
plt.title("Post-hoc Dunn's test results");
# True means there is statistically significant difference between median popularity of the two genres.
# False means  there is no statistically significant difference between median popularity of the two genres.
No description has been provided for this image
In [72]:
# Creating a new df to be used in plotly interactive hover template
# Create an empty numpy array with the same shape as df_median_popularity
new_array = np.empty((len(df_median_popularity), len(df_median_popularity)))
new_array = new_array.astype(str)
In [73]:
# Fill the new array with the value of the pairs
for i, x in enumerate(df_median_popularity.index):
    for j, y in enumerate(df_median_popularity.index):
        new_array[i][j] = f'({x}:{df_median_popularity.loc[x][0]} {y}:{df_median_popularity.loc[y][0]})'
In [74]:
# Convert the new array to a pandas dataframe
new_df = pd.DataFrame(new_array, index=df_median_popularity.index, columns=df_median_popularity.index)
In [75]:
# Display a sample
new_df.sample(1).iloc[0,0]
Out[75]:
'(Documentary:0.1422399999999995 '
In [76]:
# Interactive heatmap showing median pairwise comparisons.

fig = px.imshow(posthoc_result < 0.05, color_continuous_scale='Blues')
fig.update_xaxes(side="top")
fig.update_layout(title="Post-hoc Dunn's test results", title_x=0.5, width=1000, height=1000)
fig.update_coloraxes(showscale=False)
fig.update_traces(text=new_df, texttemplate="%{text}")
fig.update_traces(hovertemplate='%{x} - %{y} <br> %{text}')
fig.show()
# Blue color means there is statistically significant difference between median popularity of the two genres.
# White color means  there is no statistically significant difference between median popularity of the two genres.

Conclusion: There is a statistically significant difference in median popularity across genres.

Research Question 7 - Does the runtime of a movie have a different impact on its popularity depending on its genre?¶

In [77]:
# First look at the runtime-popularity scatterplot across all movies

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x="runtime", y="popularity")
plt.title(f"All Movies")
plt.xlabel("Runtime")
plt.ylabel("Popularity")
plt.show();
No description has been provided for this image
In [78]:
# Display the runtime-popularity scatterplot - genres breakdown

plt.figure(figsize=(8, 6))
g = sns.FacetGrid(df_genres_explode, col="genres", col_wrap=5, height=3, aspect=1.3)
g.map(sns.scatterplot, "runtime", "popularity")
g.set_titles("{col_name}")
g.tight_layout();
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
In [79]:
# Make it interactive. Create a dropdown widget for genre selection

genre_dropdown = widgets.Dropdown(
    options=df_genres_explode["genres"].unique(),
    description="Select Genre:",
    disabled=False,
)

def plot_scatterplot(genre):
    plt.figure(figsize=(8, 6))
    sns.scatterplot(data=df_genres_explode[df_genres_explode["genres"] == genre],
                    x="runtime", y="popularity")
    plt.title(f"{genre} Movies")
    plt.xlabel("Runtime")
    plt.ylabel("Popularity")
    plt.xlim(0, df_genres_explode["runtime"].max())
    plt.ylim(0, df_genres_explode["popularity"].max())
    plt.show()

# Display the dropdown and plot the scatterplot
widgets.interactive(plot_scatterplot, genre=genre_dropdown)
Out[79]:

Conclusion: According to the plots, it seems that the length of a film affects its popularity. It should be between 80 and 160. People do not seem to like watching longer films, especially over 200 minutes. There seem to be slight differences between genres.

Conclusions¶

  • The top 5 cast members in the dataset by frequency are Robert De Niro(72), Samuel L. Jackson(71), Bruce Willis(62), Nicolas Cage(61) and Michael Caine(53)

  • the most frequent genre is Drama(4760), the lease frequent genre is Western(165) in the dataset.

  • The most popular genre in the last three years is adventure. the most popular genre was fantasy from 2001 to 2007.

  • The number of movies increased gradually over the years.

  • Popularity skewness are right skewed across genres.

  • There is a statistically significant difference in median popularity across genres.

  • According to the plots, it seems that the length of a film affects its popularity. It should be between 80 and 160. People do not seem to like watching longer films, especially over 200 minutes. There seem to be slight differences between genres.

Limitations¶

  • The data may not be accurate or updated.

  • The budget and revenue columns do not specify the currency unit, which may vary depending on the country of production.

  • The Kruskal-Wallis test requires similar sample sizes, but some variation is acceptable. (https://www.statisticssolutions.com/kruskal-wallis-test/)

In [80]:
# !jupyter nbconvert --to html Investigate_a_Dataset.ipynb
In [81]:
# Embed widgets
# !jupyter nbconvert --to html --execute Investigate_a_Dataset.ipynb
In [82]:
# Save requirements.txt file
# !pigar generate